<a href="https://www.nlpfromscratch.com?utm_source=notebook&utm_medium=nb-header"><center><img src="../assets/coverimage_PT1.png"></center></a>

# Introduction to Large Language Models and Generative Text

Copyright, NLP from scratch, 2025.

[LLMSfor.me](https://llmsfor.me)

------------

## Introduction üé¨
In this notebook, we will explore Large Language Models (LLMs) for generative text, and show how they can be leveraged the open source libraries from [Hugging Face](https://huggingface.co/).

This notebook is best run in [Google Colab](https://colab.research.google.com/), where the majority of dependencies are already installed. However, if you wish to run the notebook locally, please follow the [directions for setting up a local environment](https://drive.google.com/file/d/1EV1seK-dUHRCzj2EDuu3ETAhUyjzOGRd/view?usp=drive_link) and you may then download the notebook as a `.ipynb` and run in either Jupyter or Jupyterlab.

Since we will be using GPU in this notebook for compute-intensive tasks, please ensure that if running on Colab the runtime type is set to GPU. In the menu in Colab, select *Runtime -> Change runtime type*, then select T4 GPU (if using Colab Free) or another GPU instance type if using Colab Pro.

<center><img src="../assets/gpu_colab.png" width="50%"/></center><br/>

Though Google Colab comes with many useful data science libraries included by default (including Pytorch), the Hugging Face libraries are not, so we will first install those here using `pip`, as they will be used in the remainder of the notebook.

- The `transformers` library, for general usage of transformer models
- The `datasets` library, for working with datasets hosted on Hugging Face
- The `accelerate` library, for using GPU for inference
- The `evaluate` library, for metrics for measuring model performance in training
- The `bitsandbytes` library for model quantization
- The `peft` library, for efficient fine-tuning of models in the second half of the workshop
- The `huggingface_hub` library, for interacting with models on the Hugging Face hub

We will also be using custom datasets from the NLP from scratch [github repo](https://github.com/nlpfromscratch/datasets/) and so we will clone this repo to have these all available locally.



In [None]:
!git clone https://github.com/nlpfromscratch/datasets.git

Cloning into 'datasets'...
remote: Enumerating objects: 70, done.[K
remote: Counting objects: 100% (70/70), done.[K
remote: Compressing objects: 100% (62/62), done.[K
remote: Total 70 (delta 14), reused 61 (delta 8), pack-reused 0[K
Receiving objects: 100% (70/70), 34.61 MiB | 16.89 MiB/s, done.
Resolving deltas: 100% (14/14), done.
Updating files: 100% (27/27), done.


In [None]:
!pip install transformers datasets accelerate evaluate bitsandbytes peft huggingface_hub

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m521.2/521.2 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m261.4/261.4 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.41.2.post2-py3-none-any.whl (92.6 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

## Large Language Models (LLMs) and the Transformer Architecture

What is a large language language model (LLM)? While there is no universally accepted definition for an LLM, large language models are a type of deep learning model that is understood be both very large in size (number of parameters) and also trained on very large datasets. Lately, the datasets used to train LLMs have grown increasingly in size and it is not unusual for these to represent double digit percentages of the web.

LLMs are a type of *deep learning* model, also known as a neural network, a type of machine learning model that seeks to imitate the structure of the human brain. Traditional deep learning models take inputs - structured data found in rows and columns like in a database, or unstructured data like images, video, audio, or, in the case of natural language processing (NLP) models, free-form text - and use these to make predictions about a target variables associated with each observation (row) in the input data. For example, in the case of a deep learning models for computer vision, the input could be a dataset of images and the output to predict an associated data label (which Japanese character is this?).

<center><img src="../assets/ann_diagram.png"/></center>

Deep learning models are composed of *layers*, each of which is composed of *nodes*, and each of the nodes has inputs which come from previous layers and associated *weights* which are multiplied by each of the inputs. Collectively, these weights are referred to as the model *parameters* we mentioned earlier, and what is learned in the "learning" of deep learning is the optimal values for these numeric parameters to best predict the outcome.

Large language models are a category of deep learning models with the following properties:
- They primarily work with language data (*i.e.* text), either as input, or output, or both - either solely or in conjuction with other data types (images, audio, video, etc.), in which case the LLMs are referred to as *multimodal*.
- As mentioned above, LLMs are understood to be both very large in size in both model size (hundreds of millions, billions, hundreds of billions, or even *trillions* of parameters) as well as trained on very large datasets, comprising hundreds of millions or billions of tokens.

It is from the latter of these two properties that the remarkable capabilities of recent LLMs has arisen. There is one another important fundamental development which lead to the rise of LLMs as we know them today and that is the transformer architecture.


### The Transformer Architecture

We refer to the structure of a deep learning network as its *architecture* - the example showed in the previous section is that of the simplest type of deep learning model architecture referred to as a *feed-forward* or *fully connected* neural network, since the outputs of each node in the hidden layers serve as input for each node in the following layer.

There are many different types of more complex and specialized architectures for specific tasks and use cases in deep learning. This have arisen from years of research in academia and application in industry. For example, convolution neural networks (CNNs) are the standard for working with computer vision tasks and have different types of layers specifically suited for this.

The Transformer architecture is an enitrely new type of neural network architecture of which the vast majority (but not all) large language models are. The Transformer was introduced fairly recently in the paper [Attention is All You Need](https://arxiv.org/abs/1706.03762) (Vaswani et al, 2018) by researchers from Google Brain. As an interesting side note, contributors to the paper have moved on to notable AI startups - Aiden Gomez went on to be one of the founders of the OpenAI competitior [Cohere](https://en.wikipedia.org/wiki/Cohere) and Noam Shazeer to co-found [Character.ai](https://en.wikipedia.org/wiki/Character.ai). While the transformer architecture was originally devised to be applied to a text-to-text translation task (English and German) and work with what now comparatively is a very small dataset, it has now been found to be broadly applicable and highly performant to a very wide variety of natural language processing tasks and represents the state-of-the-art for most of these, ranging from text generation which will cover in this workshop, to other applications such as image generation, such as with OpenAI's [DALL-E](https://en.wikipedia.org/wiki/DALL-E) models.

<center><img src="../assets/transformer_architecture.png"/ width="50%">
<br/>
<caption> The Transformer Architecture, as originally presented in the Attention is All You Need paper </caption></center>

While the mathematical and technical details of the transformer are very complex, we will not dive too deeply into these here - there are many resources out there available which cover this - for example, [Transformers from Scratch](https://peterbloem.nl/blog/transformers) and the reimplementation of the GPT model, [MinGPT](https://github.com/karpathy/minGPT) by Andrej Karpathy, previously at Tesla and now back at OpenAI.

At a high level, a couple important points to note before we dive into working with LLMs in code.


### Encoder and Decoder Models

The transformer model is made up of two large blocks: the *encoder* block on the left, and the *decoder* block on the right. While the original transformer architecture was comprised of both of these, there are now specific models which are composed of stacked blocks of each type. For example, models in the research can be *encoder-only* or *decoder-only* models, or a fully transformer with both encoder and decoder. It should be noted that, in practice, these would all still be referred to as transformer models even though the former two are not "full" transformers based upon the strict definition of the architecture.

Encoder and decoder models have different tasks to which they are well-suited. Generally speaking, encoder models take text as input and produce a higher dimensional representation of the dataset (corpus) of text - an *embedding* - on which they are trained. You will also sometime hear encoder models referred to *autoencoding* models, as they perform a similar task to the traditional [autoencoder model](https://en.wikipedia.org/wiki/Autoencoder) in deep learning.

<br/>
<center><img src="../assets/types_of_transformers.png" width="75%"/>
<br/>
<caption> Types of Transformers. Image Credit: <a href="https://www.comet.com/site/blog/explainable-ai-for-transformers/">Abby Morgan</a></caption>
</center>

Decoder models, on the other hand, take inputs and produces output probabilities. Most commonly, these are known for doing text generation, where the model take a sequence of text as input and makes predictions about the most likely occuring words which come next, as made famous by the [Generative Pretrained Transformer (GPT)](https://en.wikipedia.org/wiki/Generative_pre-trained_transformer) model by OpenAI which we will work with shortly. You will also hear decoder-only models referred to as *autoregressive* models, as they take their own outputs as inputs (in order to make predictions about a sequence of text, word-by-word) and uses these to predict probabilities for the next word (a regression task). As we will see shortly, this type of task in natural language processing is also referred to as *causal language modeling*.

## Working with Generative Text Models

### Use Cases for Generative Text Models



**Code autocompletion and AI-assisted coding**:

<img src="../assets/github-copilot-logo.jpg" width="33%"/>

Microsoft‚Äôs [Github Copilot](https://github.com/features/copilot) was launched in June 2022. Initially, more that ¬º of developers‚Äô code files on average were generated by GitHub Copilot, and today with widespread adoption this is close to nearly half (~46%) and has been used by over 1M developers. In October 2023, Copilot [surpassed $100M](https://twitter.com/swyx/status/1711792178031460618) in annually recurring revenue.

**Writing Assistants for creativity and copywriting**:

<img src="../assets/duet_ai.png" width="33%"/>

AI writing assistants have arisen for improved productivity and content creation for marketing, sales, creative, and numerous other areas. For example, Google has made this a part of their core offerings with their announcement of [Duet AI](https://workspace.google.com/solutions/ai/) and Canva has introduced [MagicWrite](https://www.canva.com/magic-write/) based upon OpenAI‚Äôs offerings.

**Entertainment and Social Uses:**

<img src="../assets/character_ai.png" width="30%"/>

Training generative language models on specific datasets has allowed to give them ‚Äúpersonality‚Äù. [Character.ai](http://Character.ai) was created by developers who previously worked on Google‚Äôs LaMDA model, offers chatbots based upon fictional characters and famous individuals. It is #2 on Anderssen- Horowitz‚Äôs list of [top 50 most popular GenAI web products](https://a16z.com/how-are-consumers-using-generative-ai/) (Sept 2023).

### Loading our first Hugging Face Model

In this section, we will start generating text with our first large language model, [GPT-2](https://huggingface.co/gpt2) and explore some of the parameters which affect the outputs from a generative text model.

The GPT-2 (Generative Pre-trained Transformer 2) model was the last of the series of GPT models from OpenAI which was "open". Following its release in 2019, GPT-3 and subsequent models did not have their weights made available publicly (and in the case for more recent models such as GPT-4, nor the details of their training data and training process).

We can easily work with GPT-2 in [Hugging Face](https://www.huggingface.co). The easiest way to get results as quickly as possible is to use a [pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) to generate text *i.e.* to perform inference.

First, we import the Pipeline class from the `transformers` library, then creator an instance of it, specifying the model type we wish to use. In this case, we want to use GPT-2, which is hosted Hugging Face themselves, not as part of a user repo, so the URL for it is just `gpt2`.

Pipelines can also be for a large variety of different tasks, we must specify that the pipeline is for text generation.

Finally, we check whether GPU is available (it should be on Colab) and if so, set the model to use GPU. This requires importing [pytorch](https://en.wikipedia.org/wiki/PyTorch) (`torch`), which is the first line of code.

In [None]:
import torch
from transformers import pipeline

# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"

# Create a pipeline of the GPT-2 model
gpt2_pipeline = pipeline('text-generation', model='gpt2', device=device)

# Create 3 output generations
outputs = gpt2_pipeline("I love applesauce!", max_length=40, num_return_sequences=3)

# Display the first output
print(outputs)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I love applesauce!'}, {'generated_text': 'I love applesauce! My girlfriend even brought us all home," she said.\n\nShe didn\'t mention the recent spate of death threats that have been lobbed at her and many of her'}, {'generated_text': "I love applesauce! I feel great every time. I love the apple flavor but I just can't take it anymore. I've tried two other ways, maybe I'll just get it right"}]


We can see that even though we've only a few lines of code, Hugging Face has pulled down over half a gigabyte of data! These are the [model weights for GPT-2](https://huggingface.co/gpt2/blob/main/pytorch_model.bin). For this part of the notebook, we are also using a smaller version of GPT - the full GPT-2 model, [GPT2-XL](https://huggingface.co/gpt2-xl) is ~6.5 GB!

Let's take a look at what's in the pipeline - it will contain both a `tokenizer`, for breaking inputs up into the tokens that GPT-2 expects, as well as a `model`, in this case, our GPT-2 model:

In [None]:
# Check the class of the tokenizer in the pipeline
type(gpt2_pipeline.tokenizer)

transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast

In [None]:
# Check the class of the model in the pipeline
type(gpt2_pipeline.model)

transformers.models.gpt2.modeling_gpt2.GPT2LMHeadModel

Furthermore, we can check the number of parameters of any Hugging Face model by calling the `num_parameters` method of a model object. How many parameters (weights) does our GPT-2 model have?

In [None]:
# Get the number of model parameters, format nicely with an f-string
f"{gpt2_pipeline.model.num_parameters():,}"

'124,439,808'

Here we can see our GPT-2 model has just over 124 million parameters. Now we can move forward into generating some text using the model.

### Generating Text
In this section, we will generate some text using the GPT-2 model, and also explore the different decoding methods for doing so, and the effect they have on outputs.

First, let us generator text from the pipeline using the default behavior. To do this, we simply pass in a string of text and no other arguments:

In [None]:
my_input_string = "The rain in Spain falls mainly in the plain"

# Generate output
output = gpt2_pipeline(my_input_string)

# Display
print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The rain in Spain falls mainly in the plain, so it comes on top of us like an ointment to help us to put our nose up.\n\nWith the wet, there is a bit of moisture in them. The rainfall in Spain'}]


We can see that the model has actually generated a `list` of outputs, each which are a dictionary. Let's take a look at the first output:

In [None]:
output[0]

{'generated_text': 'The rain in Spain falls mainly in the plain, so it comes on top of us like an ointment to help us to put our nose up.\n\nWith the wet, there is a bit of moisture in them. The rainfall in Spain'}

This is just a dictionary with a single key, `generated_text`, which contains both the input we sent into the model, as well as the tokens the model predicted. We can display the output a little more nicely using the [Markdown](https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html#IPython.display.Markdown) object from IPython (Jupyter), to render it inline like the rest of the text in our notebook here.

In [None]:
from IPython.display import Markdown

display(Markdown("---")) # dividing line
display(Markdown(output[0]['generated_text']))
display(Markdown("---")) # dividing line

---

The rain in Spain falls mainly in the plain, so it comes on top of us like an ointment to help us to put our nose up.

With the wet, there is a bit of moisture in them. The rainfall in Spain

---

There, that's better! Now the text is displayed nicely with dividers. Let's move on now to different parameters we have at our disposal for how a model generates text, or as in the language of LLMs, different *decoding strategies*.

### Text Decoding Strategies

As we will see in this section, there is some complexity to creating text outputs with generative language models. Creating new outputs from a given prompt is not as simple as entering the input and getting a predicted output. Generative text models have parameters which control the amount of variability in their outputs; this is a desirable quality to make the outputs seem both more realistic (as if from a human) and  variety being injected into the model outputs also increases the likelihood of reaching a novel result that is pleasing to the user and deemed to be "good".

First, we will consider the simplest (vanilla) text generation approaches in order to both gradually work our way up, and also contrast with, using them with methods which introduce variety and "creativity". The two simplest decoding methods for text generation we will consider first are *greedy search* and *beam search*.


#### Greedy Search

Greedy search is the simplest text generation approach: in this case, no variety is introduced as all. Recall a text generation model takes a sequence of input tokens and its task is to predict the next token given the input. For greedy search, the next predicted token is always just that with the highest probability.

<center>
<img src="../assets/greedy_search.png" width="75%"/>
</center>
<caption><i> Greedy Search. Here, for the next two tokens the words "plain" and "which" are selected, as they have the highest individual probabilities. </i></caption>

Mathematically speaking, given an input sequence of tokens $x_1, x_2, x_3...$, the model seeks to produce an output $y_t$ at step $t$. Since generative text models (decoder models) are *autoregressive* and make predictions based upon previous predictions after the initial input, mathematically we can express the prediction task as:

$ P(y_t|y_1, y_2, ..., y_{t-1},x)$

Greedy search just takes the highest probability token for each prediction. Thus for the vocabulary and different calculated probabilities by the model, this is expressed mathmeatically as:

$y_t = argmax_{y \in V}P(y|y_1,y_2,...,y_{t-1},x)$

Let's take a look at this with GPT-2, to do this we will play around with the [parameters](https://huggingface.co/docs/transformers/generation_strategies#customize-text-generation) we can pass to the call to `.generate` on our model in Hugging Face.

Now that we are stepping outside of the pipeline abstraction and working in more detail, we should probably initialize a tokenizer and model, and work with these separately, passing the outputs of the tokenizer to the model directly. To do this, we will be leveraging some of the [Auto Classes](https://huggingface.co/docs/transformers/model_doc/auto) in Hugging Face.

Since we are doing text generation, *i.e.* [causal language modeling](https://huggingface.co/docs/transformers/tasks/language_modeling), we will using the `AutoModelforCausalLM` class to create the GPT-2 model, as well as creating a tokenizer using `AutoTokenizer`.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to(device)

# Text input string
input_string = "The rain in Spain falls mainly in the plain"

Great, now we have the tokenizer, model, and input string. We pass the input string into the tokenizer to get a back a list of token ids, as well as the attention mask for the transformer:

In [None]:
# encode context the generation is conditioned on
model_inputs = tokenizer(input_string, return_tensors='pt').to(device)

In [None]:
# What is the result?
print(model_inputs)

{'input_ids': tensor([[ 464, 6290,  287, 8602, 8953, 8384,  287,  262, 8631]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}


We then pass this to the model method `generate`. Here we use the "double-star" syntax, where the dictionary that is passed in is "unpacked" by python, so the function receives separate arguments for `input_ids` and `attention_mask` from the associated values. Let's take a look at the result:

In [None]:
# Do greedy generation to generate the output token ids
greedy_output = model.generate(**model_inputs)

print(greedy_output)



tensor([[  464,  6290,   287,  8602,  8953,  8384,   287,   262,  8631,   286,
           262, 50206, 12010,    11,   475,   340,   318,   635,   287,   262]],
       device='cuda:0')


We can see that the result is just a list of integers. These are the token ids that were predicted by the model as the next most likely, based upon the tokenizer vocabulary. So we can convert these token ids back into text by passing them through the tokenizer as a final step:

In [None]:
# Decode the tokens back to text using the tokenizer
output_string = tokenizer.decode(greedy_output[0])

# Print the result
display(Markdown("---")) # dividing line
display(Markdown(output_string))
display(Markdown("---")) # dividing line

---

The rain in Spain falls mainly in the plain of the Canary Islands, but it is also in the

---

And that's it! The whole text generation process goes like this:
1. Instantiate tokenizer and model
2. Pass input string to tokenizer to generate token ids and attention mask
3. Generate output token ids (predictions) from the model
4. Decode output token ids back into text using tokenizer

We can visualize the whole process with the figure below:

<center>
<img src="../assets/text_generation_in_hf.png" width="75%"/>
</center>

It should be noted that with greedy search, we will always be picking the most likely output tokens, and so the final result will be completely determinstic and the same each time. We can see this with the behavior of the model below by generating the same output over and over:

In [None]:
# Initial generation
greedy_output = model.generate(**model_inputs)
output_string = tokenizer.decode(greedy_output[0])

In [None]:
# Output
print(output_string)

The rain in Spain falls mainly in the plain of the Canary Islands, but it is also in the


In [None]:
# Second generation
greedy_output2 = model.generate(**model_inputs)
output_string2 = tokenizer.decode(greedy_output2[0])

In [None]:
# Output
print(output_string2)

The rain in Spain falls mainly in the plain of the Canary Islands, but it is also in the


We can see that we will always get the same result as an output here based on the model. Let us now explore other approaches for generating text which generate the series of output tokens based on different approaches.

#### Beam Search

Beam search is an improvement on greedy search which considers the most likely sequence of tokens *together*, based on their respective probabilities, as opposed to just taking the most probable individual token at each timestep.

A *beam width* is specified, and over the width of the beam (number of generated tokens), the combination of tokens with the highest collective probability is selected, as opposed to just selecting the individual token with the highest probability, as with greedy search.

<center>
<img src="../assets/beam_search.png" width="75%"/>
</center>
<caption><i> Beam Search. Here, for the next two tokens the words "meadow" and "grasses" are selected, as they joint probability of 0.36 (0.4 x 0.9) is greater than that of the tokens selected in greedy search which is 0.33 (0.6 x 0.55). </i></caption>

A couple points to note about beam search is that searching over a larger sequence of tokens (*i.e.* increasing `beam_size`) will result in significantly improved quality of outputs at the cost of increased computation.

There is a "law of diminishing returns" with beam search: typically there is a saturation point beyond which increasing the beam size does not significantly change the most likely generated sequence, as the probabilities are dominated by the product of the most frequently occurring tokens in the sequence considered by beam search.

Generally speaking, beam search can lead to repetitive outputs for open-ended generation. This is why it and greedy search are used in conjuction with sampling.

To generate text with beam search in Hugging Face, we set the `num_beams` parameter to a value greater than 1 (which would be equivalent to greedy search) and `early_stopping=True`, so generation finishes when all beams pass back an "end of string" (EOS) token.

We have already created our tokenizer and model, so this can just be done in the call to `model.generate()`:

In [None]:
# Text input string
input_string = "The rain in Spain falls mainly in the plain"

# Model input
model_inputs = tokenizer(input_string, return_tensors='pt').to(device)

# Generate output with beam search
greedy_output = model.generate(**model_inputs, num_beams=10, early_stopping=True)

# Decode the output
output_string = tokenizer.decode(greedy_output[0])

In [None]:
display(Markdown(output_string))

The rain in Spain falls mainly in the plain to the south of the city of Barcelona.



We can see that beam search has returned quite a different result from that of greedy search, by looking over the collective probabilities of a number of predicted token possibilities, instead of just each following token.

#### Sampling Strategies

While the different search decoding strategies provide some varaibility in the outputs of a generative text model, they are still determinisitc in their outputs, and this can lead to either a.) poor outputs or b.) repeated identical outputs, the latter of which is not a desireable traits for end users.

As such, there also exist different *sampling strategies* for introducing variability and novelty into the outputs of generative text models. The three main parameters available for different sampling strategies are *temperature*, *top-p,* and *top-k*.


##### Temperature

The temperature is a factor which normalizes or "smooths out" the output probabilities of predicted tokens. In practice, it is used to control the variability (or randomness, or "creativity") of the outputs of a model.

Mathematically speaking, calculating the model probability for predicting any individual token as the next one, such that all probabilities lie between zero and one and sum to one, is attained using the softmax function:

$ P(y_i) = \frac{e^{z_i}}{\sum_{j=1}^{N}e^{z_j}} $

where:
- $P(y_i)$ is the probability of selecting the $i$th token.
- $z_i$ is the logit, the raw score or output, from the model for token $i$
- and $N$ is the total size of the vocabulary

we introduce a new variable $\tau$ for temperature and update the probability formula as below:

$P(y_i) = \frac{e^{z_i / \tau}}{\sum_{j=1}^{N}e^{z_j / \tau}}$

Given the above, if $\tau = 1$, the formula for the probabilities, and thus the behavior of the model, is unchanged. It can be shown that as $\tau \to \infty$, that $P(y_i) \to 1$ for all $i$, and so the likelihood of any token predicted becomes equal. This results in a completely uniform distribution of probabilities acrosss all possible tokens.

On the other hand, as $\tau \to 0$, the probability for any given token can be represented by:

$$
P(i)=\begin{cases}
    1 & \text{if $i$ is max probability}\\
    0 & \text{otherwise}
  \end{cases}
$$

That is to say, the most likely token will have a probability of 1, and then others will have their probabilities set to 0, and the output of the model will be completely deterministic.

To put in another way, setting a low value to temperature (value of 0) means that the most likely next tokens will always be returned, whereas setting higher values to temperature flattens the probabilities across the different possible tokens, resulting in increasingly random outputs for greater values of $\tau$.


This is visualized in the figure below:

<center>
<img src="../assets/temperature_comparison.png" width="80%"/>
</center>
<center><caption> Visualizing the effect of changing temperature on next token probabilities </caption></center>

There is a balance to be struck, as too low a temperature will result in a model always returning the same output for a given input - that is, acting deterministically - whereas setting the temperature too high can result in garbled and incoherent.

Now let's try experimenting with changing the temperature parameter for text geenration using GPT-2. In Hugging Face, this is controlled by the `temperature` parameter in either calls to a model pipeline, in directly in the text generation call in `model.generate()`. We must also set the `do_sample=True` argument, to tell Hugging Face to use sampling and not to do greedy search.

First, let's set a temperature (close to that) of 0, which will always result in the most likely token be chosen. Note that this is equivalent to greedy search:

In [None]:
# Text input string
input_string = "The rain in Spain falls mainly in the plain"

# Generation =  temperature ~= 0 - deterministic
model_inputs = tokenizer(input_string, return_tensors='pt').to(device)
zero_temp_output = model.generate(**model_inputs, temperature=0.00001, do_sample=True, num_return_sequences=3)

# Iterate over outputs and display in markdown
display(Markdown("---"))

for output in zero_temp_output:
  output_string = tokenizer.decode(output)
  display(Markdown(output_string))

display(Markdown("---"))

---

The rain in Spain falls mainly in the plain of the Canary Islands, but it is also in the

The rain in Spain falls mainly in the plain of the Canary Islands, but it is also in the

The rain in Spain falls mainly in the plain of the Canary Islands, but it is also in the

---

We see that the same output is returned as before, and we can run the above cell multiple times and always get back the same input. Now let's set the temperature to 1, which will leave the next token probabilities unchanged. In this case, we should be able to get different outputs:

In [None]:
# Text input string
input_string = "The rain in Spain falls mainly in the plain"

# Generation: temperature = 1, default behavior
model_inputs = tokenizer(input_string, return_tensors='pt').to(device)
temp1_output = model.generate(**model_inputs, temperature=1, do_sample=True, num_return_sequences=3)

# Iterate over outputs and display in markdown
display(Markdown("---"))

for output in temp1_output:
  output_string = tokenizer.decode(output)
  display(Markdown(output_string))

display(Markdown("---"))

---

The rain in Spain falls mainly in the plain of Barcelona city centre in the north. It rains on

The rain in Spain falls mainly in the plain of S√≥l√∫m where the rivers end in

The rain in Spain falls mainly in the plain of Iberian Peninsula, to the south of the

---

Cool, those all seem like reasonable outputs, even though they are all different. We have introduced some variability into the model outputs which makes for novelty.

Finally, let's really crank up the temperature! This will make all output tokens equally likely, resulting in very "creative" outputs:

In [None]:
# Text input string
input_string = "The rain in Spain falls mainly in the plain"

# Generation: temperature = 1B, all tokens equally likely
model_inputs = tokenizer(input_string, return_tensors='pt').to(device)
high_temp_output = model.generate(**model_inputs, temperature=1.0e9, do_sample=True, num_return_sequences=3)

# Iterate over outputs and display in markdown
display(Markdown("---"))

for output in high_temp_output:
  output_string = tokenizer.decode(output)
  display(Markdown(output_string))

display(Markdown("---"))

---

The rain in Spain falls mainly in the plainlands as it drihes back in off France, or

The rain in Spain falls mainly in the plain regions like Ligue D (near Basir da M

The rain in Spain falls mainly in the plain, just behind what seemed less developed communities like Castil

---

As we can see above, setting a high value for temperature results in more "creative" outputs but some of these are less coherent than those with lower temperature.

Now let us consider further sampling strategies for introducing variability in model outputs whilst attempting to maintain the quality thereof.

##### Top-p & Top-k sampling

Unlike temperature, which changes the different calculated probabilities of the next token, *top-p* and *top-k* instead function by reducing the size of the set of possible tokens to choose from. Though are differently in how they are applied, they both restrict the set of possible next tokens to only the most likely ones above a specified threshold, and then redistribute the probability mass amongst this smaller set. They are typically used in conjunction with temperature to produce varied but still comprehensible outputs.

In *top-k* sampling, instead of calculating probabilities and sampling from all possible tokens, a cutoff integer value $k$ is specified, and only the top $k$ ranked tokens are used as the set of possible next tokens. The total probability (summing to 1) is redistributed amongst these top $k$ tokens.

This is illustated in the figure below. Instead of choosing from all possible next words, only the top 5 words would be considered, and the probabilities would be redistributed amongst them:

<center>
<img src="../assets/top_k.png" width="50%"/></center>
<center><caption> Top-k sampling: only the most probable tokens above and including rank $k$ are kept </caption></center>

*Top-p*, or *nucleus sampling* differs in that instead of specifying a rank $k$ and taking the most probable tokens this rank or above, in top-p a probability threshold $p$ is specified, and only the top tokens which a combined probability above this threshold are kept in the set of next possible tokens. This differs from top-k in that we don't specify the size of the set of next tokens, only the total probability.

Coming back to our previous example, here using top-p, we wish only to keep tokens which have a combined probability equal to or above a threshold 0.8. In this case the top four most likely next tokens meet this criteria (as $0.5 + 0.15 + 0.1 + 0.05 = 0.8$) so the total probabilty would be redistributed only amongst them:

<center>
<img src="../assets/top_p.png" width="50%"/></center>
<center><caption> Top-p sampling: only the tokens with cumulative probability above the specified threshold $p$ are kept </caption></center>

In Hugging Face, top-k and top-p sampling can be used by specifying them in with the arguments `top_k` and `top_p` respectively. `top_k` is an integer value, and `top_p` a floating point between 0 and 1.

Note that both of these will still just return the most likely sequences (deterministically) and so should be combined with beam search and/or temperature. These allows returning multiple outputs with `num_return_sequences` as we've seen before:

In [None]:
# Text input string
input_string = "The rain in Spain falls mainly in the plain"
model_inputs = tokenizer(input_string, return_tensors='pt').to(device)

# Generation - Top-k & Top-p
top_k_output = model.generate(**model_inputs, top_k=30, do_sample=True, num_return_sequences=3)
top_p_output = model.generate(**model_inputs, top_p=0.5, do_sample=True, num_return_sequences=3)

# Top K
display(Markdown("---"))
display(Markdown("Top-k, $k=30$:"))
for output in top_k_output:
  output_string = tokenizer.decode(output)
  display(Markdown(output_string))

# Low Top K
display(Markdown("---"))
display(Markdown("Top p, $p=0.5$:"))
for output in top_p_output:
  output_string = tokenizer.decode(output)
  display(Markdown(output_string))
display(Markdown("---"))

---

Top-k, $k=30$:

The rain in Spain falls mainly in the plain of Catalonia, an important city which lies along the coast

The rain in Spain falls mainly in the plain of Sertra on a Saturday, but this week

The rain in Spain falls mainly in the plain which is part of the Caja Rural (Cran

---

Top p, $p=0.5$:

The rain in Spain falls mainly in the plain, and there are no visible signs of rain in the

The rain in Spain falls mainly in the plain of the Canary Islands, which is the home of the

The rain in Spain falls mainly in the plain of Madrid, but in the mountains of the Andalus

---

Top-p and Top-K can be used in conjunction, to avoid very low ranked words while allowing for variability. In pratice, this requires a fair bit of trial and error to find good values for $k$ and/or $p$, combined with temperature.

In [None]:
# Putting it all together
outputs = model.generate(
    **model_inputs,
    do_sample=True,
    top_k=30,
    top_p=0.5,
    temperature=1.5,
    num_return_sequences=3,
)

display(Markdown("---"))
for output in outputs:
  output_string = tokenizer.decode(output)
  display(Markdown(output_string))
display(Markdown("---"))

---

The rain in Spain falls mainly in the plain, which is covered with a thick layer of thick layer

The rain in Spain falls mainly in the plain of the Andalusia Mountains, which have been a

The rain in Spain falls mainly in the plain. The city's roads are covered with asphalt, and

---

## Working with a Chat-style model: LLaMA 3.2

Now that we have worked with a basic generative text model, we will move on to working with a modern LLM with a "chat" style model. In this section we will use LLaMA 3.2, from Meta's incredibly popular open source [LLaMA](https://www.llama.com/) series of models.

A chat-style model (or "instruct" model as they are also referred to, as they receive instructions from the user) actually functions exactly the same as a regular generative text model such as GPT-2, the only difference is in the training data and the way the model outputs are displayed.

The models were trained on JSON which contain conversations, with three different roles:
- The **user** role: This is us, or the person talking to the chat bot.
- The **assistant** role: These are the responses from the model.
- The **system** role: This is a role that dictates the overall behavior of the model and style of its responses.

So, in a way, an chat-style model is not actually responding, but applying the "autcomplete on steroids" of regular generative text models, just in this case autcompleting a conversation, and we are only show the responses from the *assistant* role, and we provide the responses for the *user* role.

Let's see this in action with LLaMA 3.2. Unfortunately, using the LLaMA series of models from the [official Hugging Face repos](https://huggingface.co/meta-llama) requires accepting a [license](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt) and user agreement, which therefore means they cannot be used anonymously (*i.e.* without a HF account).

Fortunately, a copy of 1B parameter version of LLaMA-3.2-Instruct is provided by Unsloth in their repo at https://huggingface.co/unsloth/Llama-3.2-1B-Instruct.

Let's test it out using a `pipeline`. Here the model is trained differently, so we have to provide the text in the JSON format it expects with *system* and *user* roles:

In [None]:
import torch
device = "cuda"
llama_32 = "unsloth/Llama-3.2-1B-Instruct"

prompt = [
    {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
    {"role": "user", "content": "Write a poem about applesauce."},
]

generator = pipeline(model=llama_32, device=device, torch_dtype=torch.bfloat16)
generation = generator(
    prompt,
    do_sample=False,
    temperature=1.0,
    top_p=1,
    max_new_tokens=1000
)

print(f"Generation: {generation[0]['generated_text']}")

config.json:   0%|          | 0.00/927 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

Generation: [{'role': 'system', 'content': 'You are a helpful assistant, that responds as a pirate.'}, {'role': 'user', 'content': 'Write a poem about applesauce.'}, {'role': 'assistant', 'content': "Yer lookin' fer a poem about applesauce, eh?\n\nOh, applesauce, sweet and fine,\nA treasure from the orchard's vine.\nIn jars or pouches, ye be stored,\nA treasure fer young and old.\n\nMe mouth waters at the thought o' thee,\nA taste o' autumn, wild and free.\nNo need fer sugar, no need fer spice,\nJust applesauce, a simple, sweet device.\n\nIn the morning, on yer toast or bread,\nA spoonful o' applesauce, a treat ahead.\nOr in a stew, or in a pie,\nApplesauce, a flavor that never dies.\n\nSo here's to applesauce, me hearty friend,\nA treasure that never doth end.\nMay yer belly be full, and yer heart be light,\nWith applesauce, the perfect delight!"}]


Now to get the model response, we just print the generated text for the *assistant* role - the 3rd element in the array:

In [None]:
display(Markdown(generation[0]['generated_text'][2]['content']))

Yer lookin' fer a poem about applesauce, eh?

Oh, applesauce, sweet and fine,
A treasure from the orchard's vine.
In jars or pouches, ye be stored,
A treasure fer young and old.

Me mouth waters at the thought o' thee,
A taste o' autumn, wild and free.
No need fer sugar, no need fer spice,
Just applesauce, a simple, sweet device.

In the morning, on yer toast or bread,
A spoonful o' applesauce, a treat ahead.
Or in a stew, or in a pie,
Applesauce, a flavor that never dies.

So here's to applesauce, me hearty friend,
A treasure that never doth end.
May yer belly be full, and yer heart be light,
With applesauce, the perfect delight!

Here we have only seen a single input and response from the model, but this is, in principle how chat-based applications like ChatGPT work.

## Conclusion
This concludes Part 1. Next week in Part 2, we will continue where we left off and fine-tune and LLM with a custom dataset.

----

<table border="0" bgcolor="white">
  <tr></tr>
  <tr>
      <th align="left" style="align:left; vertical-align: bottom;"><p>Copyright NLP from scratch, 2025.</p></th>
      <th aligh="right" width="33%"><a href="https://www.nlpfromscratch.com?utm_source=notebook&utm_medium=nb-footer-img"><img src="../assets/banner.png"></th>
</tr>
</table>